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Abstract 

Evolvable hardware combines the powerful search capability of evolutionary al- 
gorithms with the flexibility of reprogrammable devices, thereby providing a natural 
framework for reconfiguration. This framework has generated an interest in using 
evolvable hardware for fault-tolerant systems because reconfiguration can effectively 
deal with hardware faults whenever it is impossible to provide spares. But systems 
cannot tolerate faults indefinitely, which means reconfiguration does have a deadline. 
The focus of previous evolvable hardware research relating to fault-tolerance has been 
primarily restricted to restoring functionality, with no real consideration of time con- 
straints. In this paper we are concerned with evolvable hardware performing reconfig- 
uration under deadline constraints. In particular, we investigate reconfigurable hard- 
ware that undergoes intrinsic evolution. We show that fault recovery done by intrinsic 
reconfiguration has some restrictions, which designers cannot ignore. 

1 Introduction 

Reliable systems can be depended on to provide continual service. Unfortunately, faults are 
inevitable, which leads to disruptions in service. One way of increasing a system's availability 
is to make it fault-tolerant — i.e., capable of detecting and recovering from hardware faults. 
Exchanging a faulty component with an operating spare is the most widely used method 
for hardware fault recovery fT], but it is not always possible to have redundant hardware 
available. For instance, the very restrictive space and weight requirements typically found 
on spacecraft makes it difficult to find room for spares. Reconfiguring a faulty system elimi- 
nates the need for redundant hardware, although reconfiguration does not always guarantee 
full functionality can be restored. Nevertheless, reconfiguration is a viable fault recovery 
technique for any system with limited free space. 
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Evolvable hardware (EHW) has emerged as a powerful method for doing original hard- 
ware design — which naturally suggests it could be equally useful for doing hardware recon- 
figuration. The idea behind EHW is to combine the biologically-inspired search methods 
of evolutionary algorithms with the flexibility of reconfigurable hardware. The evolutionary 
algorithm searches throughout the space of all possible configurations looking for the one 
that performs the best. Every configuration the evolutionary algorithm finds must be eval- 
uated and there are two accepted methods: extrinsic evolution where the evaluation is done 
in software, and intrinsic evolution where the evaluation is done on a hardware implementa- 
tion. In many instances intrinsic evolution is necessary because the only real way to evaluate 
a configuration is to implement it and have it actually operate in the physical environment. 
We refer to a reconfiguration search using intrinsic evolution as intrinsic reconfiguration. 
(An excellent introduction to EHW can be found in [2].) 

Two types of devices are suitable reconfigurable architectures for analog systems: the field 
programmable analog array (FPAA) and the field programmable transistor array (FPTA). 
These devices represent different levels of granularity. The FPAA provides configurable 
blocks of circuitry along with programmable routing resources. Conversely, the FPTA con- 
sists of an array of MOSFET transistors interconnected via programmable switches. A small 
number of capacitors are included on-chip, but resistors are synthesized using the MOSFET 
transistors. FPAAs are available as commercial off-the-shelf (COTS) devices (e.g., see |3J); 
the FPTA was fabricated for NASA's Jet Propulsion Laboratory and is currently only avail- 
able for research studies 

Some recent work has shown EHW can be quite effective for reconfiguring existing hard- 
ware to overcome faults [SlinilZI- The ability of evolutionary algorithms to find good recon- 
figurations is not at issue in this paper. We instead are concerned with the issue of time. 
Most EHW-based studies rely on device simulators rather than physical hardware. This 
simulator use means the time to download a configuration, the time to program the device, 
and the time to conduct a fitness evaluation on the reconfigured hardware has largely been 
ignored — even though they dramatically affect the evolutionary algorithm's running time. 

Systems cannot operate indefinitely with faults. This means fault recovery must be com- 
pleted within specified timeframes or undesirable consequences will occur. The introduction 
of a deadline into fault recovery means fault-tolerant systems must be considered real-time 
systems (RTS). Greenwood, et. al jH] were the first to suggest EHW-based reconfiguration 
has a time limit. In this paper it will be shown how reconfiguration time impacts the choice 
of a fault recovery mechanism. 

2 Preliminary Definitions 

This section provides a brief introduction to real-time systems. Interested readers are referred 
to some of the excellent books on this topic for further information (e.g., see P]). We begin 
with a formal definition of a real-time system. 

Definition: (real-time system) 
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Any system that is both logically and temporally correct 

Logical correctness means the system satisfies all functional specifications. Temporal 
correctness means the system is guaranteed to perform these functions within explicit time- 
frames. Fault-tolerant systems qualify as real-time systems because fault detection and fault 
recovery inherently have deadlines. That is, the fault must be detected within a certain 
period of time after it occurs, and the fault must be corrected within a certain period of 
time after it is detected. Fault recovery may also have an expected start time. 

The notion of real-time is often interpreted to mean really fast. This interpretation is 
not correct. Real-time does not necessarily mean fast — and fast does not necessarily mean 
real-time. Suppose a document must be sent from Chicago to London, and two delivery 
systems are available: surface mail with a guaranteed 3-day delivery time or e-mail with a 
guaranteed 5 minute delivery time. The e-mail delivery is orders of magnitude faster than 
surface mail, but that does not necessarily mean it qualifies as a real-time delivery system. 
It is the required delivery deadline that ultimately establishes whether the real-time system 
definition has been met. For example, both systems are real-time systems if the deadline 
is six days because both are logically and temporally correct. However, neither one is a 
real-time system if the deadline is three minutes because neither one is temporally correct. 

Real-time systems are classified as hard or soft. Hard systems have catastrophic con- 
sequences if the temporal requirements are not met — up to and including complete system 
destruction. In fact, if the hard system is safety-critical, failure could lead to injury or even 
death. Conversely, soft systems only have degraded performance if the temporal require- 
ments aren't met. The classification of a fault-tolerant system, in particular, depends on the 
nature of the faults and the consequences for failing to detect and correct them in a timely 
manner. Suppose a fault results in an over-temperature condition. If the system hardware 
can survive this condition for up to five minutes, then fault recovery must be completed 
within five minutes to prevent further damage. This would be a hard fault-tolerant system. 
On the other hand, if the fault only causes a minor loss of some sensor data, fault recovery 
could take considerably longer without dire consequences. This would be a soft fault-tolerant 
system. 

3 Quantifying Reconfiguration Time 

The main parameter we concentrate on is the programming time (tp) for reconfigurable analog 
devices. Table ^ shows the programming time (tp) for several reconfigurable devices. This 
programming time cannot be ignored because EHW algorithms frequently have populations 
sizes in the hundreds and they run for thousands of generations. 

Example 1: 

Suppose a proportional-derivative controller is implemented in an FPAA. A controller's 
fitness is found by applying a step input to the control system and then measuring its settling 
time. The fitness evaluation lasts at least as long as the settling time does, which can be 
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Device 


Type 


Size 


Mfg 


tp (ms) 


Ref. 


Notes 


ispPAClO 


FPAA 


4 


Lattice Semiconductor 


100 




10 






AN220E04 


FPAA 


4 


Anadigm 


3.8 
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1,2 


JPL's FPTA2 


FPTA 


64 


fabricated by MOSIS 


0.008 
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(1) All 18 banks are reloaded with 256 bytes/bank 

(2) Serial transfer with 10 MHz clock 

(3) Byte-wide transfers with 160 MHz clock 



Table 1: Programming times for various popular reconfigurable analog devices. All are 
COTS devices except for the FPTA. The units for size are modules for FPAAs and cells for 
FPTAs. The references indicate where the tp value is documented. 

somewhat lengthy. Indeed, settling times of two minutes are not unheard of 13 . Under 
these circumstances, it wouldn't take a very large population size nor a large number of 
generations to make an intrinsic reconfiguration run for hours or even days before finishing. 

Example 2: 

An AN220E04 FPAA is used to compensate for aging effects in a control system responsi- 
ble for positioning a satellite's communications antenna. The reconfiguration search is done 
by a generational GA run for 500 generations with a population size of 100. The system's 
step response is measured to determine if the compensation is correct. This step response 
test takes 625 milliseconds to conduct. It takes 3.8 + 625 = 628.8 ms to reprogram the 
FPAA and compute the evolved compensator's fitness, but a total of 500,000 compensators 
are evolved during the evolutionary algorithm's run. Hence, the reconfiguration time takes 
~ 8.7 hours. 

4 Discussion 

Reconfiguration times are meaningless until they are put into context. For instance, take 
Example 2 from the previous section. Suppose brief communication sessions with the satellite 
are scheduled at 10 hour intervals. A session may be skipped, but skipping two sessions in a 
row is not permitted. If a fault is detected just prior to a scheduled session, and if the error 
results in missing the session, then the fault recovery deadline is 10 hours. This is the worst 
case scenario^ . An almost 9 hour reconfiguration time may seem quite long, but in this case 
it is perfectly acceptable because < 10. On the other hand, it would not be acceptable if 
communication sessions were scheduled at 6 hour intervals. 

The only way to determine if there is a problem is to compare the reconfiguration time 
against the fault recovery deadline. This latter quantity is system dependent. No problem 
exists so long as the reconfiguration time is less than the recovery deadline. 

^ Missing one session is permitted. If the fault is detected just after a scheduled session, the fault recovery 
deadline would be 20 hours. 
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This time comparison adds a new perspective on intrinsic evolution and, at the same time, 
imposes a new requirement. Reconfiguration becomes a real-time process whenever it is used 
as a fault recovery method. Consequently, it is no longer sufficient to just talk about how an 
evolutionary algorithm was able to restore a circuit's functionality. These statements may 
show logical correctness, but without comparing the reconfiguration time against a deadline 
there is no proof of temporal correctness. Just reporting an algorithm's running time doesn't 
say anything about temporal correctness either. The key point is expressed by the following 
first principle: 

No EHW-based recovery method can legitimately proclaim efficacy until it is 
proven to be both logically and temporally correct. 

The validity of this principle is easy to see. If the recovery method isn't logically correct, 
then the problem can't be fixed. If it isn't temporally correct, then the problem can't be 
fixed soon enough to prevent other things from going wrong. Without proving logical and 
temporal correctness, there is no basis for claiming a fault recovery method is effective. 

It is easy to prove if a fault recovery method is logically correct — try it and see if it 
ffxes the problem. Proving temporal correctness, however, is more complicated because it 
really depends on conducting a thorough failure modes and effects analysis (FMEA). This 
analysis should identify all potential faults and their effects on system performance ^1] . One 
outcome of a FMEA are the recovery deadlines. Temporal correctness is proven if a logically 
correct recovery is guaranteed to finish prior to the recovery deadline. 

Greenwood, et. al jB] suggested evolutionary algorithms designed for reconfiguration 
searches perform best if they have high selection pressure and if they emphasize muta- 
tion for reproduction. In principle, any type of evolutionary algorithm could be used for 
a reconfiguration search, but from a practical standpoint genetic programming algorithms 
should be avoided. Genetic programming algorithms designed for EHW problems are put 
on large multiprocessor systems to abridge their long running time ^SjEj. This becomes 
especially problematic for fault-tolerant systems because, if there isn't enough room for re- 
dundant hardware, then there isn't enough room for a large multiprocessor system either. 
It seems unlikely a full-fiedged genetic programming search, run on a single processor, could 
finish quickly enough to meet a fault recovery deadline of only a few hours. 

5 Conclusions 

EHW-based reconfiguration is a viable method of performing fault recovery in systems with- 
out redundant hardware. Fault-tolerant systems are real-time systems. Consequently, any 
attempts to intrinsically evolve a new hardware configuration must consider the device pro- 
gramming time and the fitness evaluation time because they both contribute to the recon- 
figuration time. 

It has been shown neither a large population size nor thousands of generations are neces- 
sary to have reconfiguration searches with surprisingly long finishing times. However, a long 
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search time by itself is not enough to reject reconfiguration as a fault recovery method. Intrin- 
sic reconfiguration can be used for fault recovery so long as it finishes before the mandatory 
recovery deadline. 
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